- Covid and the course
- Course pages
- Course overview
- Introduction to SLV
- Some examples
- Data Wrangling
- Wrap-up
Supervised Learning and Visualization
Parts of this week’s slides may be based on materials from previous iterations of Data Analysis and Visualization courses. The authors of these materials include, but may not be limited to: Erik-Jan van Kesteren, Daniel Oberski and Peter van der Heijden.
When figures and other external sources are shown, the references are included when the origin is known.
With the exception of the first lecture, all lectures are on location. There are some rules by which we obide:
Covid related:
Procedure related:
The first lecture will be recorded because of schedule clashes.
The on-location lectures will not be recorded.
If you feel that you are stuck, and the wait for the Q&A session is too long: open a GitHub issue here.
reprex to detail your issue, when code is involved.If you expect that you are going to miss some part(s) of the course, please notify me via a private MS-Teams message.
You can find all materials at the following location:
All course materials should be submitted through a pull-request from your Fork of
The structure of your submissions should follow the corresponding repo’s README. To make it simple, I will add an example for the first of each submission type.
If you are unfamiliar with GitHub, forking and/or pull-request, please study this exercise from one of my other courses. There you can find video walkthroughs that detail the process.
All three have a PhD in statistics and a ton of experience in development, data analysis and visualization.
| Week # | Focus | Teacher | Materials |
|---|---|---|---|
| 1 | Data wrangling with R |
GV | R4DS ISLR |
| 2 | The grammar of graphics | GV | R4DS |
| 3 | Exploratory data analysis | GV | R4DS FIMD |
| 4 | Statistical learning: regression | MC | ISLR, TBD |
| 5 | Regression model evaluation | MC | ISLR, TBD |
| 6 | Statistical learning: classification | EJvK | ISLR, TBD |
| 7 | Classification model evaluation | EJvK | ISLR, TBD |
| 8 | Nonlinear models | MC | ISLR, TBD |
| 9 | Bagging, boosting, random forest and support vector machines | MC | ISLR, TBD |
Each weak we have the following:
Twice we have:
Once we have:
We will make groups on Monday Sept 13!
| Exploratory | Confirmatory | |
|---|---|---|
| Description | EDA; unsupervised learning | One-sample t-test |
| Prediction | Supervised learning | Macro-economics |
| Explanation | Visual mining | Causal inference |
| Prescription | Personalised medicine | A/B testing |
Exploratory Data Analysis:
Describing interesting patterns: use graphs, summaries, to understand subgroups, detect anomalies, understand the data
Examples: boxplot, five-number summary, histograms, missing data plots, …
Supervised learning:
Regression: predict continuous labels from other values.
Examples: linear regression, support vector machines, regression trees, … Classification: predict discrete labels from other values.
Examples: logistic regression, discriminant analysis, classification trees, …
How do you think that data analysis relates to:
People from different fields (such as statistics, computer science, information science, industry) have different goals and different standard approaches.
data analysis.In this course we emphasize on drawing insights that help us understand the data.
| Exploratory | Confirmatory | |
|---|---|---|
| Description | ||
| Prediction | ||
| Explanation | ||
| Prescription |
Source: wikimedia commons and MIMP summerschool slide 28
Challenger space shuttle - 28 Jan 1986 - 7 deaths
Challenger disaster
How wages differ
Jon Snow and Cholera
Election prediction
Flu trends
Brontë or Austen
Elevation, climate and forest
The tree of life
Where would you place each example in the table?
Can we think of other common questions?
Can we think of an example of a case where the model did not do well?
When high risk decisions are at hand, it paramount to analyze the correct data.
When thinking about important topics, such as whether to stay in school, it helps to know that more highly educated people tend to earn more, but also that there is no difference for top earners.
Before John Snow, people thought “miasma” caused cholera and they fought it by airing out the house. It was not clear whether this helped or not, but people thought it must because “miasma” theory said so
Election polls vary randomly from day to day. Before aggregating services like Peilingwijzer, newspapers would make huge news items based on noise from opinion polls.
If we know flu is coming two weeks earlier than usual, that’s just enough time to buy shots for very weak people.
If we know how ecosystems are affected by temperature change, we know how our forests will change in the coming 50-100 years due to climate change.
Scholars fight over who wrote various songs (Wilhelmus), treatises (Caesar), plays (Shakespeare), etc., with shifting arguments. By counting words, we can sometimes identify the most likely author of a text, and we can explain exactly why we think that is the correct answer.
Biologists have been constructing the tree of life based on appearance of the animal/plant. But sometimes the outward appearance corresponds by chance. DNA is a more precise method, because there is more of it, and because it is more directly linked to evolution than appearance. But there is so much of it that we need automated methods of reconstructing the tree.
The examples have in common that data analysis and the accompanying visualizations have yielded insights and solved problems that could not be solved without them.